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Abstract. At macroevolutionary time scales, and for a constant mutation rate, there 
is an expected linear relationship between time and the number of inferred neutral muta- 
tions (the "molecular clock"). However, at shorter time scales a number of recent studies 
have observed an apparent acceleration in the rate of molecular evolution. 

We study this apparent acceleration under a Jukes-Cantor model applied to a ran- 
domly mating population, and show that, under the model, it arises as a consequence 
of ignoring short term effects due to existing diversity within the population. The ac- 
celeration can be accounted for by adding the correction term /ioe -4/it / 3 to the usual 
Jukes-Cantor formula p(t) = |(1 — e~ 4M */ 3 ), where ho is the expected heterozygosity in 
the population at time t — 0. The true mutation rate n may then be recovered, even if 
ho is not known, by estimating /i and ho simultaneously using least squares. 

Rate estimates made without the correction term (that is, incorrectly assuming the 
population to be homogeneous) will result in a divergent rate curve of the form /idi V = 
[i + C/t, so that the mutation rate appears to approach infinity as the time scale ap- 
proaches zero. While our quantitative results apply only to the Jukes-Cantor model, 
it is reasonable to suppose that the qualitative picture that emerges also applies to 
more complex models. Our study therefore demonstrates the importance of properly 
accounting for any ancestral diversity, as it may otherwise play a dominant role in rate 
overestimation. 



1. Introduction 



It is well-known since Kimura (ll968r Il983[ ) that over time scales of tens and hundreds 
of millions of years the longer term rate of molecular evolution, k, is expected to be vir- 
tually the same as the short term rate of neutral mutations /i, t hat is, k ~ /i. However , 
at shorte r time scales a numb e r of recent s t udies ( for example Garcia-Morenol ( 2004 ): 
Ho et all (l2005f ): IPennvi fl2005h: iMillar et all (l2008h: iHenn et all fl2009h : although there 
were some earlier indications ( Fitch and Atchlevl . 19851 )) have observed an apparent ac- 
celeration in the rate of molecular evolution. This has led to debate as to the underlying 
causes of the apparent acceleration, and raised important questions as to how long it 
persists and how to correct for it. 

A number of factors that may contribute to this apparent rate acceleration have been 
proposed, and we refer the reader to Ho et al. (1201 If) for a recent review. One such facto r 
is the effect of ancestral polymorphism (IPeterson and Masell . 120091 : ICharlesworthl l2010f ). 
Indeed, it is clear that, if ancestral polymorphism is present and not properly accounted 
for, then an apparent rate acceleration will inevitably be seen. Consider a comparison 
between two sequences drawn from a population at time t = which is incorrectly assumed 
to be homogeneous. Any differences between the sequences due to polymorphism will 
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appear to be mutations that have occurred in zero time, leading to an apparent infinite 
substitution rate at time zero. Moreover, on sufficiently short time scales the average 
difference due to polymorphism will dominate any changes due to substitution, and to 
first order the expected difference between sequences at times and t will be given by 
p(t) = C+fi't. (Here fx' will differ slightly from the mutation rate fx, as some mutations will 
act to decrease polymorphism.) Naively dividing by t to recover ft, under the assumption 
that the first order approximation is given by p{t) = fit, then gives an apparent divergent 
rate curve ft' + C/t, suggesting that unaccounted ancestral polymorphism will contribute 
a term of the form C/t to rate estimates. This a priori estimate is in sharp contra st wit h 
the use of exponential decay curves of the form k + fte~ xt (for example, by Ho et al. f 2005[ ) ) 
to fit estimated rates. 

The effect of ancestral polymor phism on sequence divergence has been studied quan- 
titatively by Peterson and Masel (l2009l ). who calculate the expected divergence between 
two sequences as a function of time under the Mor an m odel with selectio n. Th eir model 
provides a reasonable fit to data from Genner et al. (120071 ) and Henn et al. ( 120091 ). confirm- 
ing that the observed rate acceleration is consistent with an underlying constant mutation 
rate. However, their model does not admit a simple closed form analytic expression, and 
this makes it difficult to correct for the apparent rate acceleration or quantify its duration. 

The purpose of this paper is to quantify the effect of ancestral polymorphism under 
a Jukes-Cantor type model applied to a randomly mating population. This is a more 
restrictive setting than that considered by Peterson and Masel, but the benefit of working 
in this simpler setting is that we are able to obtain simple and explicit analytic expressions 
for quantities of interest. In particular, we show that, under the model, ancestral polymor- 
phism contributes a term /ioc -4 ^ 3 to the usual Jukes-Cantor formula pit) = 1(1— e -4 ^*/ 3 ), 
where ho is the expected heterozygosity at time t — 0, and confirm that comparisons be- 
tween sequences made under the incorrect assumption ho = lead to a divergent rate 
curve of the form fi& v = fi + C/t. This has consequences at long time scales as well as 
short, as C/t tends to zero comparatively slowly as t tends to infinity (in particular, more 
slowly than any exponential decay), and so the effects of ancestral polymorphism may be 
relatively long lived. We show however that the true mutation rate ft may still be recov- 
ered from several observations even if the level of heterozygosity at time is unknown, 
by estimating ho simultaneously with fi using least squares. We show further that our 
correction term also applies to other population structures (for example, island models) 
under an appropriate assumption on ho- 

These results show that ancestral polymorphism, where present and unaccounted for, 
will be a significant contributing factor to mutation rate overestimation, with the mag- 
nitude of the effect approaching infinity as the time scale shrinks to zero. Our principal 
finding then is that an apparent rate acceleration at short time scales is consistent with - 
and indeed to be expected under — a constant mutation rate, in the presence of ancestral 
polymorphism that is not pr operly taken into account. This finding is in agreement with 
that of Peterson and Masel fj200flh . 

Ancestral polymorphism is just one of several processes that are thought to contribute 
to the apparent rate acceleration, and is unlikely to be the sole contributing factor. How- 
ever, it is clearly vital that its effect be quantified and accounted for if we are to obtain 
meaningful and accurate rate estimates on short time scales, and fully resolve the question 
of the apparent acceleration. 
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2. The model, and our main result 



We consider a population of N individuals evolving under a random mating process, 
with discrete generations, where at each generation the allele of each individual is re- 
placed witlia copy of the allele from the previous generation, chosen uniformly at ran- 



dom (IWrightl . 119311 ). For simplicity we restrict our attention to haploid populations. 



We ass ume that alleles are r-s tate characters evolving under the r-state Jukes-Cantor 



model ( iJukes and Cantorl . Il969l ). with an instantaneous mutation rate of /i mutations per 
individual per generation per site. All mutations are assumed to be neutral. We orient 
time in the forwards direction, so that populations at times t > descend from the pop- 
ulation at t = 0. We refer to the population at time as the reference population, and 
the population at a given time t > of interest as the contemporary population. 

Consider a pair of individuals chosen uniformly at random, one from each of the ref- 
erence and contemporary populations, and let P(t) be the probability that they have 
different character states at a fixed site. Suppose that the distribution of character states 
at that site in the reference population is given by n = (7Ti, . . . , 7iy). (Note that n as used 
here is the distribution of states within the population at a given site, rather than the 
equilibrium distribution of the Jukes-Cantor model, which is the distribution of states 
across all sites within an individual.) Then 

Theorem. The probability that uniformly randomly chosen members of the reference and 
contemporary populations have different states at a given site is given by 

(1) P(t) = hoe-^* + ^(1 - e-^), 

where 

r 

h = p(o) = - ^) = J2 E w 

i=l i jy^i 

is the expected heterozygosity att — 0. In particular, for r = 4 (as for nucleotides) we get 

P{t) = hoe-fc + Kl-e-***). 

The proof of the theorem is given in the Appendix. Note that the second term in 
equation ([I]) is the standard probability under the r-state Jukes-Cantor model that a net 
change takes place at the given site, when the contemporary sequence directly descends 
from the reference sequence — this second term is the usual form of the equation for 
longer time periods. Thus the theorem adds the correction term h e~ rfJ,t ^ r ~ 1 ^ when the 
sequences being compared are not necessarily direct descendants. This accounts for the 
variation that is present in the reference population from past mutations that have not 
yet been either fixed or lost. The two expressions P(t) and the Jukes-Cantor mutation 
probability are asymptotic, in the sense that 

P(«) Ke-^ ^ 



as t — > oo. 

Note that the theorem requires that the population at the earlier time be used as the 
reference population. In particular, if populations at several different times are to be 
compared, then this should be done by comparing each population against the oldest 
population, to ensure that all pairwise comparisons made involve the same value of the 
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initial heterozygosity ho- This requirement may be relaxed if there is reason to believe 
that the heterozygosity h is constant, or nearly so. In general, we expect ir (and hence h) 
to depend on t, but as t — > oo we also expect 7r to approach an equilibrium distribution 
TToo, where the rate at which new mutations are introduced balances the rate at which 
mutations are fixed or lost by the mating process. If this equilibrium is assumed to have 
occurred then equation (TfJ) may be written in the form 

(2) P(t) = Jwf A* + ==i(l - e-^"*), 

where is the expected heterozygosity at equilibrium. Under this assumption any two 
populations at times t\ and t 2 may be compared, using t = \ti — t 2 \. 



2.1. Recovering the mutation rate from observed data. We now demonstrate how 
the mutation rate fi may be estimated from observations of P(t). If the expected het- 
erozygosity is assumed to have reached equilibrium, then may be substituted for ho 
throughout. 

Equation ([I]) may be rearranged to the form 

(3) P(t) = (l - (1 - ^A Q )e-^) . 

If ho is known then we may estimate [i as 

r - 1 1 - -hPif) 

(4) /*= — -log - r -\ y . 

rt 1 -rhn 

r— 1 u 

On the other hand, if ho is not known then it will need to be estimated simultaneously 
with \i. This may be done using a least squares fit to 

log (1 - j£iP(*)) = log(l - ^h ) - ^t; 

if the least squares line is y = mt + b then we recover \x and ho as 

^ = - r -i 1 m, /, = r^i(i_e 6 ). 



2.2. The divergent rate curve. A divergent rate curve arises when we either omit or 
use an incorrect value for ho to estimate /i. In particular, if we ignore the existing diversity 
and thus use h = we get the estimate 

(r-l)log(l-^jP(t)) 
(5) /x div = 

(r- 1)^(1-^/10) 



rt 



= H 

which adds the divergent term ^— L to the estimate above. Thus, if the 

variation present within the reference population is ignored we obtain a divergent rate 
curve of the form /z div = + C/t, where C = — log (l — -zihoj is a positive constant 
independent of t. This rate estimate tends to infinity as t — > 0, and thus becomes 
increasingly inaccurate on shorter and shorter times scales. 
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3. Results of simulations 

Figure 1A shows results of simulations over microevolutionary time scales, and exhibits 
the phenomena described above. Most importantly, if we do not correct for existing genetic 
diversity there is the apparent acceleration at shorter times, even though the basic neutral 
mutation rate [i is kept constant. This first main point then is that the apparent increase 
in "rate" (the divergent rate curve) is obtained at shorter periods. However, it is then 
important that, using equation (j3J) and the value of ho, the value of /i can still be estimated 
accurately at these shorter intervals. This is certainly as expected; it has always been 
assumed that the mutation rate /i was basically constant, that it was independent of time. 
Thus the conclusion from Figure 1A is that by explicitly considering genetic variability 
in the ancestral (reference) population — either by using an a priori estimate of it (blue 
curve), or by estimating it simultaneously using least squares (dashed black line) — it is 
possible to recover the mutation rate [i. In practice, it may be that knowledge of \x (from 
longer term studies) may also be important in understanding population structure and 
its change over time. 



3.1. The simulations. The simulations were run using the statistical package R ( 120111 ). 
Each simulation begins with a haploid population containing 70 individuals having allele 
A and 30 individuals having allele B, giving a true h of 0.42. There are four allele types 
in total, and mutations from any given allele to any of the other three take place at 
equal rates, as in the Jukes-Cantor model for DNA substitutions. This population is then 
evolved for 1000 generations. In each generation, reproduction is simulated according 
to the Wright-Fisher process, whereby each of the 100 individuals making up the new 
population is chosen randomly with replacement from the previous generation; each of 
these new individuals is then subjected to mutation at the rate of \i = 0.001 mutations per 
individual per generation. For the final step in each generation, a pair of individuals - 
one from the current population and one from the initial population — is picked randomly 
and compared; if they have different alleles, a counter for that generation is increased by 
1. The entire simulation was repeated 2000 times, with counters accumulating across 
runs: consequently the final counter value for generation i, after dividing by 2000, is an 
estimate of the probability that an individual picked randomly from the initial population 
differs from an individual picked randomly from generation %. These relative frequencies 
are used to estimate the mutation rate parameter fi in several different ways, as we now 
describe. 



3.2. Estimating //. Three different approaches to estimating // are shown in Figure 1A. 
The dashed pink curve shows the result of estimating \x while incorrectly assuming ho = 
(the assumption made by most previous studies); the solid blue curve shows the result 
of estimating // assuming h to be its true value, 0.42. The difference is plain, particu- 
larly at early times when the probability of observing a difference is dominated by the 
heterozygosity of the initial population. To demonstrate the limited accuracy achievable 
using the incorrect assumption that h = 0, the \l curve that would be estimated from 
an infinite number of observations (instead of 2000) is shown as a dashed green line — it 
is scarcely better, indicating that sampling error is not the problem. The fact that the 
pink curve tracks the dashed green curve also indicates that, as expected, the simulation 
fits our theoretical divergent curve /idiv of equation ([5]). Note that the erratic behaviour 
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of the blue curve near is due to numerical instability, as the calculations there involve 
logs and ratios of numbers close to zero. 

Both of the above estimation procedures can estimate /i using a probability estimate 
from a single time point, but assume ho to be known a priori. Where this is not the case, 
ho can be estimated simultaneously with \x using least squares, which requires probability 
estimates from multiple time points. The result of estimating a single value for [i using 
all 1001 time points is shown as a horizontal dashed black line in Figure 1A. 

3.3. Other population structures. Our corrected formula extends to cases where the 
population consists of multiple subpopulations that interact via arbitrary migration rates, 
provided that the initial distribution of states is the same for each subpopulation. Fig- 
ure IB shows the accurate recovery of \x and h for a simulated population consisting 
of two islands, each containing 70 individuals having allele A and 30 individuals having 
allele B, that mix sporadically and asymmetrically: in each generation, the probability 
that an individual in subpopulation 1 comes from subpopulation 2 is just 1%, while the 
probability that an individual in subpopulation 2 comes from subpopulation 1 is 10%. 
Adding this limited kind of population structure does not alter the fact that the proba- 
bility distribution of states for the ancestor of an individual chosen randomly at time t 
remains equal to that for an individual chosen randomly from the initial population, so 
the probability pdiff that the ancestral and reference states differ is still given by ho- 

If the initial state distributions differ across subpopulations, then the probability that 
the ancestor of a randomly-chosen contemporary individual has a particular state is no 
longer constant, so h in equation (jTJ must be replaced with a function of time, Pdis(t). 
Figure 1C shows results from a simulation of one such scenario. There are two islands, 
each containing 100 individuals: the first contains 70 individuals having allele A and 30 
individuals having allele B, the second contains 25 individuals with each of the four alleles. 
Migration rates are as for Figure IB. The blue curve, which shows an attempt to estimate 
\x using equation (J4]) and assuming that h = 0.6675 (the probability that two individuals 
chosen randomly from the initial population have different states), produces a divergent 
rate curve because this more general population structure violates the assumption that 
Pdis is constant and equal to h . 

Despite the fact that pdiff is nonconstant when the subpopulations have different initial 
state distributions, our approach is still useful here. Under reasonable conditions on mi- 
gration probabilities, the probability that the ancestor of a randomly-chosen individual 
has a given state converges towards an equilibrium value as t — )■ oo, and so pdiff will ap- 
proach an equilibrium value also. In particular, when estimating [i and ho simultaneously 
via least squares, only the estimate of ho is affected by different initial state distributions 
across subpopulations: as expected, the dashed black line depicting the least squares esti- 
mate of \i in Figure 1C is close to the true value of fi. Since /i is usually the parameter of 
interest, this means that our model can still be used for inference with these more general 
population structures. Note that any edge-weighted graph describing mating probabili- 
ties within a population can be represented as a multi-island model in which each island 
contains a single individual. 

4. Conclusions 

The Jukes-Cantor model is one of the oldest and simplest substitution models, and 
the benefit of studying ancestral polymorphism in this simple setting is that we are able 
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Figure 1A: Estimated mutation rate 
N = 1 00, n = 0.001 , h = 0.42, 1 000 generations, 2000 repetitions 
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\i estimated using Eqn 4 (correctly assuming h = 0.42) 
True n = 0.001 

\i estimated as 0.001047 using least squares 

(h simultaneously estimated as 0.4179) 
|i estimated using Eqn 5 (incorrectly assuming h = 0) 
\i estimated from perfect data using Eqn 5 
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Figure 1. Estimated mutation rates against time for three different 
population structures. (1A): A single population initially containing 
70 individuals having allele A and 30 individuals having allele B; (IB): 
A 2-island model, both of whose subpopulations are initially the same 
as for (1A), with 1% migration per generation from island 1 to island 
2, and 10% migration per generation in the reverse direction; (1C): A 
2-island model, whose first subpopulation is the same as for (1A), and 
whose second subpopulation contains 25 individuals of each of the four 
states. The graphs show the results of simulations of four state data. 
Dashed red line: the value of the mutation rate /i used. 

Blue curve: /j estimated at each time point from simulated data trans- 
formed using equation (j3J) and the correct value for the ancestral het- 
erozygosity ho. 

Pink curve: simulated data transformed using the incorrect transform 

(equation fl5])) to estimate /x, giving a divergent rate curve. 
Dashed green curve (1A, IB): theoretical divergent rate curve, ob- 
tained by transforming perfect data (equation ([I])) according to equa- 
tion (jSj). This assumes an incorrect value of h Q = 0. 
Dashed black line: the least squares estimate of \i from the simulated 
data. This uses no knowledge of h or \i (unlike the blue curve). 
Note that the blue curve behaves poorly near 0, where the calculations are 
numerically unstable as they involve logs and ratios of numbers close to 
zero. 
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Figure 1B: Estimated mutation rate for 2-island model, 

equal initial distributions 
= 200, u, = 0.001 , h = 0.42, 1 000 generations, 2000 repetitions 
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H estimated using Eqn 4 (correctly assuming h = 0.42) 
True n = 0.001 

\i estimated as 0.001 01 5 using least squares 

(h simultaneously estimated as 0.4251) 
|i estimated using Eqn 5 (incorrectly assuming h = 0) 
\i estimated from perfect data using Eqn 5 
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to obtain explicit analytic expressions for quantities of interest, including the expected 
divergence between two sequences, and the uncorrected rate curve /Zdiv Moreover, many 
more complex and realistic models include the Jukes-Cantor model as a special case, 
so while our precise quantitative findings are only supported under the limited model 
considered here, there are clear qualitative implications for these more general models. 

Given the rapid advance in DNA sequencing technology we expect a large increase 
in sequences that have diverged in recent, intermediate and longer times. It is there- 
fore essential to be able to generalise results to a full range of divergence times — it is 
no longer appropriate to consider separately shorter term microevolutionary and longer 
term macroevolutionary studies. Our results show that a significant contributing fac- 
tor to the apparent acceleration at shorter times is the pre-existing diversity within a 
popul ation, and this can be estimated for a wide range of population sizes and struc- 
tures ( Charlesworth and Charlesworth . 2010l ). By considering the expected diversity in 



a population we show how the real rate of neutral mutation can be estimated at shorter 
times, using either an appropriate estimate of genetic diversity or data from multiple time 
points. As the timescales of traditional phylogenetics and population genetics draw closer 
together, we anticipate seeing an increasing emphasis on the careful handling of genetic 
diversity in phylogenetic analyses; the work presented here can provide a starting point 
for the exploration of more general models involving more complex substitution processes, 
time-varying population sizes, and other effects. 
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Figure 1C: Estimated mutation rate for 2-island model, 
unequal initial distributions 
: 200, \i = 0.001 , h = 0.6675, 1 000 generations, 2000 repetitions 
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Appendix A. Proof of the main result 

The individual chosen from the contemporary population descends from a unique mem- 
ber of the reference population, and this ancestor is equally likely to be any member of 
the reference population. We may therefore calculate the probability that the contempo- 
rary and reference states differ by considering two cases: the ancestral state agrees with 
the reference state, but evolves to disagree; and the ancestral state disagrees with the 
reference state, and evolves so as to remain in disagreement at time t. 

Under the r-state Jukes-Cantor model the probability that there is a net transition 
from state i to state j ^ i in time t is given by 

The probability that the ancestral state agrees with the reference state but evolves to 
disagree is therefore given by 

(6) E 7r '( r - 1 )^) = ( r - 1 )^)E^ 

i i 

while the probability that both the ancestral and contemporary states differ from the 
reference is given by 



(7) 



i j^i 
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Observe that 

\ i / i i jfa i 

so irf = 1 — h . Hence summing equations flSj) and (j7|) we get 

P(t) = (1 - h )(r - l)p(t) + h (1 - p(f)) 

= (r - l)p(t) + /io(l - p(t) - (r - l)p(t)) 
= (r - l)p(t) + h (l - rp(t)) 

r 

as claimed. 
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